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Abstract — We consider continuous-time sparse stocliastic pro- 
cesses from which we have only a finite number of noisy/noiseless 
samples. Our goal is to estimate the noiseless samples (denoising) 
and the signal in-between (interpolation problem). By relying 
on tools from the theory of splines, we derive the joint a 
priori distribution of the samples and show how this probability 
density function can be factorized. The factorization enables us 
to tractably implement the maximum a posteriori and minimum 
mean-square error (MMSE) criteria as two statistical approaches 
for estimating the unknowns. We compare the derived statistical 
methods with well-knovra techniques for the recovery of sparse 
signals, such as the £i norm and Log (£i-io relaxation) regular- 
ization methods. The simulation results show that, under certain 
conditions, the performance of the regularization techniques can 
be very close to that of the MMSE estimator. 

Index Terms — Denoising, Interpolation, Levy Process, MAP, 
MMSE, Statistical Learning, Sparse Process. 



I. Introduction 

THE recent popularity of regularization techniques in 
signal and image processing is motivated by the sparse 
nature of real-world data. It has resulted in the development 
of powerful tools for many problems such as denoising, de- 
convolution, and interpolation. The emergence of compressed 
sensing, which focuses on the recovery of sparse vectors from 
highly under-sampled sets of measurements, is playing a key 
role in this context [1], [2], [3]. 

Assume that the signal of interest {s[«]}i^o ^ finite-length 
discrete signal also represented by s as a vector) that has a 
sparse or almost sparse representation in some transform or 
analysis domain (e.g., wavelet or DCT). Assume moreover 
that we only have access to noisy measurements of the form 
{s[i] = s[i] + r?,[?.]}^_|-j, where {^'[*]}i-o denotes an additive 
white Gaussian noise. Then, we would hke to estimate {s[i]}i. 
The common sparsity-promoting variational techniques rely on 
penalizing the sparsity in the transform/analysis domain [4], 
[5] by imposing 

{«H}»™0 = ^gflin - + AJsparse(s)}, (1) 

where s is the vector of noisy measurements, Jsparsc(') is 
a penalty function that reflects the sparsity constraint in the 
transform/analysis domain and A is a weight that is usually 
set based on the noise and signal powers. The choice of 
•^sparse (■) = || ' is One of the favorite ones in compressed 
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sensing when {s[i]}™o is itself sparse [6], while the use of 
>/sparse(s) = TV{s), where TV stands for total variation, is a 
common choice for piecewise-smooth signals that have sparse 
derivatives [7]. 

Although the estimation problem for a given set of mea- 
surements is a deterministic procedure and can be handled 
without recourse to statistical tools, there are benefits in 
viewing the problem from the stochastic perspective. For 
instance, one can take advantage of side information about 
the unobserved data to estabhsh probability laws for all or 
part of the data. Moreover, a stochastic framework allows 
one to evaluate the performance of estimation techniques and 
argue about their distance from the optimal estimator. The 
conventional stochastic interpretation of the variational method 
in (1) leads to the finding that {s[i]}™o 's the maximum a 
posteriori (MAP) estimate of {s[i]}^o. In this interpretation, 
the quadratic data term is associated with the Gaussian nature 
of the additive noise, while the sparsifying penalty term 
corresponds to the a priori distribution of the sparse input. 
For example, the penalty Jsparsc( ) = II • ll^i is associated 
with the MAP estimator with Laplace prior [8], [9]. However, 
investigations of the compressible/sparse priors have revealed 
that the Laplace distribution cannot be considered as a sparse 
prior [10], [11], [12]. Recently in [13], it is argued that (1) is 
better interpreted as the minimum mean-square error (MMSE) 
estimator of a sparse prior. 

Though the discrete stochastic models are widely adopted 
for sparse signals, they only approximate the continuous nature 
of real- world signals. The main challenge for employing 
continuous models is to transpose the compressibility/sparsity 
concepts in the continuous domain while maintaining com- 
patibiUty with the discrete domain. In [14], an extended class 
of piecewise-smooth signals is proposed as a candidate for 
continuous stochastic sparse models. This class is closely 
related to signals with a finite rate of innovation [15]. Based 
on infinitely divisible distributions, a more general stochastic 
framework has been recently introduced in [16], [17]. There, 
the continuous models include Gaussian processes (such as 
Brownian motion), piecewise-polynomial signals, and a-stable 
processes as special cases. In addition, a large portion of the 
introduced family is considered as compressible/sparse with 
respect to the definition in [11] which is compatible with the 
discrete definition. 

In this paper, we investigate the estimation problem for 
the samples of the continuous -time sparse models introduced 
in [16], [17]. We derive the a priori and a posteriori proba- 
biUty density functions (pdf) of the noiseless/noisy samples. 
We present a practical factorization of the prior distribution 
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which enables us to perform statistical learning for denois- 
ing or interpolation problems. In particular, we implement 
the optimal MMSE estimator based on the message-passing 
algorithm. The implementation involves discretization and 
convolution of pdfs, and is in general, slower than the common 
variational techniques. We further compare the performance 
of the Bayesian and variational denoising methods. Among 
the variational methods, we consider quadratic, TV, and Log 
regularization techniques. Our results show that, by matching 
the regularizer to the statistics, one can almost replicate the 
MMSE performance. 

The rest of the paper is organized as follows: In Section 
II, we introduce our signal model which relies on the general 
formulation of sparse stochastic processes proposed in [16], 
[17J. In Section IV, we explain the techniques for obtaining 
the probability density functions and, we derive the estimation 
methods in Section III. We study the special case of Levy 
processes which is of interest in many applications in Section 
V, and present simulation results in Section VI. Section VII 
concludes the paper. 

II. Signal Model 

In this section, we adapt the general framework of [16] 
to the continuous-time stochastic model studied in this paper 
We follow the same notational conventions and write the 
input argument of the continuous-time signals/processes inside 
parenthesis (e.g., ,s(-)) while we employ brackets (e.g., ,s[-]) 
for discrete-time ones. Moreover, the tilde diacritic is used 
to indicate the noisy signal. Typically, s[-] represents discrete 
noisy samples. 

In Figure 1, we give a sketch of the model. The two 
main parts are the continuous-time innovation process and the 
linear operators. The process s(-) is generated by applying the 
shaping operator L~^ on the innovation process w. It can be 
whitened back by the inverse operator L. (Since the whitening 
operator is of greater importance, it is represented by L while 
refers to the shaping operator.) Furthermore, the discrete 
observations s[-] are formed by the noisy measurements of 
si-). 

The innovation process and the linear operators have distinct 
implications on the resultant process s. Our model is able 
to handle general innovation processes that may or may not 
induce sparsity/compressibility. The distinction between these 
two cases is identified by a function f{uj) that is called 
the Levy exponent, as will be discussed in Section II-A. 
The sparsity/compressibihty of s and, consequently, of the 
measurements s, is inherited from the innovations and is 
observed in a transform domain. This domain is tied to the 
operator L. In this paper, we deal with operators that we 
represent by all-pole differential systems, tuned by acting upon 
the poles. 

Although the model in Figure 1 is rather classical for Gaus- 
sian innovations, the investigation of non-Gaussian innovations 
is nonlrivial. While the transition from Gaussian to non- 
Gaussian necessitates the reinvestigation of every definition 
and result, it provides us with a more general class of stochas- 
tic processes which includes compressible/sparse signals. 



A. Innovation Process 

Of all white processes, the Gaussian innovation is un- 
doubtedly the one that has been investigated most thoroughly. 
However, it represents only a tiny fraction of the large family 
of white processes, which is best explored by using Gelfand's 
theory of generalized random processes. In his approach, 
unUke with the conventional point-wise definition, the stochas- 
tic process is characterized through inner products with test 
functions. For this purpose, one first chooses a function space 
£ of test functions (e.g., the Schwartz class <S of smooth and 
rapidly decaying functions). Then, one considers the random 
variable given by the inner product {w,(p), where w represents 
the innovation process and ip € £ [18]. 

Definition 1: A stochastic process is called an iimovation 
process if 

1) it is stationary, i.e., the random variables {w,(pi) and 
{w, 992) are identically distributed, provided (p2 is a 
shifted version of (pi, and 

2) it is white in the sense that the random variables (w, ipi) 
and {u!,(f2) are independent, provided (pi,(p2 € £ are 
non-overlapping test functions (i.e., <pi<p2 = 0). 

The characteristic form of w{-) is defined as 

y<fG£: =E{e-j<'"^'^>}, (2) 

where E{ } represents the expected- value operator. The char- 
acteristic form is a powerful tool for investigating the prop- 
erties of random processes. For instance, it allows one to 
easily infer the probability density function of the random 
variable {w, (p), or the joint densities of {w, ipi) , . . . , {w, Lpn). 
Further details regarding characteristic forms can be found in 
Appendix A. 

The key point in Gelfand's theory is to consider the form 

^^(^)=exp(^^/(^(./;))d.T^ . (3) 

and to provide the necessary and sufficient conditions on /(w) 
(the Levy exponent) for w to define a generalized innovation 
process over S (dual of S). The class of admissible Levy 
exponents is characterized by the Levy-Khintchine represen- 
tation theorem [19], [20] as 

+ f (eJ-- - 1 - ja;al]_i,i[(a)) v{a) da, (4) 

JK\{0} 

where — 1 for a d B and otherwise, and v{-) (the 

Levy density) is a real-valued density function that satisfies 

/ min(l, a^)w(a) da < 00. (5) 

^R\{0} 

In this paper, we consider only synmietric real- valued Levy 
exponents (i.e., /i = and v{a) = v{—a)). Thus, the general 
form of (4) is reduced to 
2 p 

/(w) = — — -|- / (cos(aw) — 1) v{a) da. (6) 
2 Jr\{o} 

Next, we discuss three particular cases of (6) which are of 
special interest in this paper. 
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Fig. 1. Connection between the white noise w{x), the sparse continuous signal s(x), and the discrete measurements s[i 



1) Gaussian Innovation: The choice v 
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This shows that the random variable [v 
Gaussian distribution with variance a'^\\(p 

2) Impulsive Poisson: Let a = and v{a) = Apa(a), where 
Pa is a symmetric probability density function. The corre- 
sponding white process is known as the impulsive Poisson 
innovation. By substitution in (6), we obtain 



/ip(^^) 



= A / (cos(aa;) 
Jr\{o} 

= X{pa{^)-1), 



l)pa{a) da 



(9) 



where pa denotes the Fourier transform of Pa- Let l[o,i] 
represents the test function that takes 1 on [0, 1] and 
otherwise. Thus, if X ^ (wip,l[o i]), then for the pdf of 
X we know that (see Appendix A) 



pxix) = 



oo 

E 

oo 
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(x) 



4=1 



i times 



It is not hard to check (see Appendix II in [14] for a proof) that 
this distribution matches the one that we obtain by defining 



w{x) 



fcez 



akS{x - Xk), 



(11) 
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where S{-) stands for the Dirac distribution, {xk} 
a sequence of random Poisson points with parameter A, and 
{ak}k£Z is an independent and identically distributed (i.i.d.) 
sequence with probability density pa independent of {xk}- The 
sequence {xk}kez is a Poisson point random sequence with 
parameter A if, for all real values a < b < c < d, the random 
variables iVi — \{xk} H [a,b]\ and N2 = \{xk} H [c, o?]| are 
independent and iVi (or iV2) follows the Poisson distribution 
with mean A(6 — a) (or X{d — c)), which can be written as 



Prob{iVi = n} 



(12) 



In [14], this type of innovation is introduced as a potential 
candidate for sparse processes, since all the inner products 
have a mass probability at x = 0. 

3) Symmetric a-Stable: The stable laws are probability 
density functions that are closed under convolution. More 
precisely, the pdf of a random variable X is said to be stable 
if, for two independent and identical copies of X, namely, 
Xi,X2, and for each pair of scalars < ci,C2, there exists 
< c such that ciXi + C2X2 has the same pdf as cX. For 
stable laws, it is known that c" — c" +C2 for some < a < 2 
[21]; this is the reason why the law is indexed with a. An 
a-stable law which corresponds to a symmetric pdf is called 
symmetric a-stable. It is possible to define symmetric a-stable 
white processes for < a < 2 by considering a ~ and 
v{a) = j^TT^, where < Cq. From (6), we get 



cos{au!) 



/E\{0} 

Ca\OJ\ 



sin^ (I) 



da — 

dx = - 



if) 



-Ca\L0\°', 



where Cq is a positive constant. This yields 



da 
(13) 

(14) 



which confirms that every random variable of the form {w^ , (p) 
has a symmetric a-stable distribution [21]. The fat-tailed 
distributions including a-stables for < a < 2 are known 
to generate compressible sequences [11]. Meanwhile, the 
Gaussian distributions are also stable laws that correspond to 
the extreme value a = 2 and have classical and well-known 
properties that differ fundamentally from non-Gaussian laws. 

The key message of this section is that the innovation 
process is uniquely determined by its Levy exponent /(w). 
We shall explain in Section II-C how /(w) affects the sparsity 
and compressibility properties of the process s. 

B. Linear Operators 

The second important component of the model is the 
shaping operator (the inverse of the whitening operator L) 
that determines the correlation structure of the process. For the 
generalized stochastic definition of s in Figure 1, we expect 
to have 

(s,^) = (L-i«;,^) - (u;,L-iV), (15) 

where L^^* represents the adjoint operator of L~^. It shows 
that L^^ ought to define a valid test function for the 
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equalities in (15) to remain valid. In turn, this sets constraints 
on L~^. The simplest choice for would be that of an 
operator which forms a continuous map from S into itself, 
but the class of such operators is not rich enough to cover 
the desired models in this paper For this reason, we take 
advantage of a result in [16] that extends the choice of shaping 
operators to those L^^ operators for which L^^ forms a 
continuous mapping from S into Lp for some 1 < p. 

1) Valid Inverse Operator L~^: In the sequel, we first 
explain the general requirements on the inverse of a given 
whitening operator L. Then, we focus on a special class of 
operators L and study the implications for the associated 
shaping operators in more details. 

We assume L to be a given whitening operator, which may 
or may not be uniquely invertible. The minimum requirement 
on the shaping operator L^^ is that it should form a right- 
inverse of L (i.e., LL^^ = I, where I is the identity operator). 
Furthermore, since the adjoint operator is required in (15), 
L^^ needs to be linear. This implies the existence of a kernel 
h{x, t) such that 



^(i-t)— h L-i{-} ^Ld,T{-} 



I3l,t{x-t) 



h(x, T)w{T)dT. 



(16) 



Linear shift-invariant shaping operators are special cases that 
correspond to h{x, r) = h{x — t). However, some of the L^^ 
operators considered in this paper are not shift-invariant. 

We require the kernel h to satisfy the following three 
conditions: 

(i) Lh{x, t) — 6{x — t), where L acts on the parameter x 
and 5 is the Dirac function, 

(ii) h{x, t) = 0, for r > max(0, x), 

(iii) (1 + |r|P-i) ^^^da; is bounded for all p > 1. 
Condition (i) is equivalent to LL^^ = I, while (iii) is a 
sufficient condition studied in [16] to establish the continuity 
of the mapping L^^ : 5 Lp, for all p. Condition (ii) 
is a constraint that we impose in this paper to simplify the 
statistical analysis. For a; > 0, its implication is that the 
random variable s{x) — L^^w(x) is fully determined by w{t) 
with T < X, or, equivalently, it is independent of w(t) for 

T > X. 

From now on, we focus on differential operators L of the 
form X)"=o ^«1-'*' where D is the first-order derivative (g^), 
D° is the identity operator (I), and \i are constants. With 
Gaussian innovations, these operators generate the autoregres- 
sive processes. An equivalent representation of L, which helps 
us in the analysis, is its decomposition into first-order factors 
as L = -^rinr=i(l-' ^ The scalars are the roots of 
the characteristic polynomial and correspond to the poles of 
the inverse linear system. Here, we assume that all the poles 
are in the left half-plane SKr^ < 0. This assumption helps us 
associate the operator L to a suitable kernel h, as shown in 
Appendix B. 

Every differential operator L has a unique causal Green 
function pL [22]. The linear shift-invariant system defined by 
h(x,T) = pi^{x — t) satisfies conditions (i)-(ii). If all the poles 
strictly lie in the left half-plane (i.e., < 0), due to absolute 
integrability of pL (stability of the system), h{x, r) satisfies 
condition (iii) as well. The definition of L^^ given through the 



(a) 



w{x) 



ut{x) 



(b) 



Fig. 2. (a) Linear shift-invariant operator L^j ^ and its impulse response 
/3l,t (L-spline). (b) Definition of the auxiliary signal ut{x). 



kernels in Appendix B achieves both linearity and stability, 
while loosing shift-invariance when L contains poles on the 
imaginary axis. It is worth mentioning that the application of 
two different right-inverses of L on a given input produces 
results that differ only by an exponential polynomial that is in 
the null space of L. 

2) Discretization: Apart from qualifying as whitening op- 
erators, differential operators have other appealing properties 
such as the existence of finite-length discrete counterparts. 
To explain this concept, let us first consider the first-order 
continuous-time derivative operator D that is associated with 
the finite-difference filter H{z) = 1 — z^^. This discrete 
counterpart is of finite length (FIR filter). Further, for any 
right inverse of D such as D^^, the system is shift 

invariant and its impulse response is compactly supported. 
Here, Dd.T is the discretized operator corresponding to the 
sampling period T with impulse response (^S{-) — 5{- — T)). 
It should be emphasized that the discrete counterpart H{z) 
is a discrete-domain operator, while the discretized operator 
acts on continuous-domain signals. It is easy to check that 
this impulse response coincides with the causal B-spline of 
degree (l[o,i[). In general, the discrete counterpart of L = 
A„ nr=i(^ ~ ''i^) defined through its factors. Each D — r^I 
is associated with its discrete counterpart Hi{z) = 1 — c''*z^^ 
and a discretized operator given by the impulse response 
5{-) — c^^^ 5{- — T). The convolution of n such impulse 
responses gives rise to the impulse response of l^d.T (up to 
the scaling factor A„), which is the discretized operator of L 
for the sampling period T. By expanding the convolution, we 
obtain the form X^ILo dT[k]5{' — kT) for the impulse response 
of Ijd.T- It is now evident that l^d.T corresponds to an FIR filter 
of length {n + 1) represented by {dT[fc]}fc=o with dT[0] ^ 0. 
Results in spline theory confirm that, for any right inverse 
L^^ of L, the operator L^^tL^^ is shift invariant and the 
support of its impulse response is contained in [Q,nT) [23]. 
The compactly supported impulse response of tL^, which 
we denote by /3l,t(')' is usually referred to as the L-spline. 
We define the generalized finite differences by 

ut{x) = (/3l,t * w){x) = (L^^tL"^ w){x) 

n 

= {Ld,Ts){x)^Y.'^T[k]s{x-kT). (17) 

k=0 

We show in Figures 2 (a), (b) the definitions of P-l,t{x) 
and ut{x), respectively. 
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C. Sparsity/Compressibility 

The innovation process can be thought of as a concatenation 
of independent atoms. A consequence of this independence 
is that the process contains no redundancies. Therefore, it is 
incompressible under unique representation constraint. 

In our framework, the role of the shaping operator L^^ is 
to generate a specific correlation structure in s by mixing the 
atoms of w. Conversely, the whitening operator L undoes the 
mixing and returns an incompressible set of data, in the sense 
that it maximally compresses the data. For a discretization of 
s corresponding to a sampling period T, the operator Ijd^T 
mimics the role of L. It efficiently uncouples the sequence 
of samples and produces the generalized differences ut, 
where each term depends only on a finite number of other 
terms. Thus, the role of ljd,T can be compared to that of 
converting discrete-time signals into their transform domain 
representation. As we explain now, the sparsity/compressibility 
properties of ut are closely related to the Levy exponent / 
of w. 

The concept is best explained by focusing on a special 
class known as Levy processes that correspond to /3L,T(a;) = 
^[o,T\{x) (see Section V for more details). By using (3) 
and (53), we can check that the characteristic function of 
the random variable ut is given by e^^'^'^\ When the Levy 
function / is generated through a nonzero density v{a) that 
is absolutely integrable (i.e., impulsive Poisson), the pdf of 
Ut associated with e^t{<^) necessarily contains a mass at the 
origin [20] (Theorems 4.18 and 4.20). This is interpreted as 
sparsity when considering a finite number of measurements. 

It is shown in [11], [12] that the compressibility of the 
measurements depends on the tail of their pdf. In simple terms, 
if the pdf decays at most inverse-polynomially, then, it is 
compressible in some corresponding Lp norm. The interesting 
point is that the inverse-polynomial decay of an infinite- 
divisible pdf is equivalent to the inverse-polynomial decay 
of its Levy density w(-) with the same order [20] (Theorem 
7.3). Therefore, an innovation process with a fat-tailed Levy 
density results in processes with compressible measurements. 
This indicates that the slower the decay rate of v{-), the more 
compressible the measurements of s. 

D. Summary of the Parameters of the Model 

We now briefly review the degrees of freedom in the 
model and how the dependent variables are determined. The 
innovation process is uniquely determined by the Levy triplet 
{lJ,,a,v). The sparsity/compressibility of the process can be 
detemiined through the Levy density v (see Section II-C). The 
values n and or, equivalently, the poles {rj}"^i of 

the system, serve as the free parameters for the whitening 
operator L. As explained in Section II-B2, the taps of the FIR 
filter drii] are determined by the poles. 

III. Prior Distribution 

To derive statistical estimation methods such as MAP and 
MMSE, we need the a priori distributions. In this section, by 
using the generalized differences in (17), we obtain the prior 
distribution and factorize it efficiently. This factorization is 



fundamental, since it makes the implementation of the MMSE 
and MAP estimators tractable. 

The general problem studied in this paper is to estimate 
s{x) at {x = «^}^_Q for arbitrary rUg G N* (values of 
the continuous process s at a uniform sampling grid with 
T = ^), given a finite number {m + 1) of noisy/noiseless 
measurements {s[fc]}fcLo °f *(^) the integers. Although 
we are aware that this is not equivalent to estimating the 
continuous process, by increasing rus we are able to make the 
estimation grid as fine as desired. For piecewise-continuous 
signals, the limit process of refining the grid can give us access 
to the properties of the continuous domain. 

To simplify the notations let us define 

f ST\i] = s(a;)U=iT, ^g. 

\ UT[i] = UT{x)\x=iT, 

where ut{x) is defined in (17). Our goal is to derive the joint 
distribution of st[*] (a priori distribution). However, the st[*] 
are in general pairwise-dependent, which makes it difficult 
to deal with the joint distribution in high dimensions. This 
corresponds to a large number of samples. Meanwhile, as 
will be shown in Lemma 1, the sequence {wrHli forms a 
Markov chain of order (n — 1) that helps in factorizing the 
joint probability distributions, whereas {sT[«]}i does not. The 
leading idea of this work is then that each UT[i\ depends on a 
finite number of urli], j 7^ i- It then becomes much simpler 
to derive the joint distribution of {uT[«]}j and link it to that 
of {sT[i]}i. Lemma 1 helps us to factorize the joint pdf of 
{uT[i\}i- 

Lemma 1: For N > n and i>0, where n is the differential 
order of L, 

1) the random variables are identically distributed, 

2) the sequence {wT[*]}i is a Markov chain of order n—1, 
and 

3) the sample UT[i+N] is statistically independent of uxli] 
and ST[i]- 

Proof. First note that 

UT[i] = {h,T*w){x%=^T = (w, /3L,T(iT--)). (19) 

Since f3i^_T{iT — •) functions are shifts of the same function 
for various i and w is stationary (Definition 1), {uT[«]}i are 
identicaUy distributed. 

Recalling that /3l,t(') is supported on [0,nT), we know 
that pL,T{iT - •) and /3l,t((« + N)T - ■) have no common 
support for N > n. Thus, due to the whiteness of w c.f 
Definition 1), the random variables {w , Ph.TiiT — •)) and 
(w' ) PL,T{{'i + — •)) are independent. Consequently, we 
can write 

Pu{uT[i + n] I urii + n- l],UT[i + n - 2], . . .) 
= Pu{uT[i + Ti] \ UT[i + n-l],...,UT[i + l]), (20) 

which confirms that the sequence {MT[«]}i forms a Markov 
chain of order (n — 1). Note that the choice of TV > n is due 
to the support of /3l,t- If the support of /3l,t was infinite, it 
would be impossible to find j such that ut[«] and utU] were 
independent in the strict sense. 
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To prove the second independence property, we recall that 

ST[i] = s{x)\^^^^ = f h{lT,T)wiT)dT = {w,h{iT,-)). (21) 



Condition (ii) in Section II-Bl implies that h{iT,T) = for 
r > max(0,iT). Hence, h{iT,T) and l3L,T{{i+N)T--) have 
disjoint supports. Again due to whiteness of w, this imphes 
that utIi + N] and St[*] are independent. ■ 

We now exploit the properties of ut [i] to obtain the a priori 
distribution of .s^ [v']. Theorem 1, which is proved in Appendix 
C summarizes the main result of Section IIL 

Theorem 1: Using the conventions of Lemma 1, for k > 
2n — 1 we have 



Ps{sT[k], 

k 

n 

e=2n-l 



,.,st[0]) = 

\dT[0\\pu(uT[9] 



xp,(sT[2n-2],...,ST[0]). 



(22) 



In the definition of proposed in Section II-B, except 
when all the poles are strictly included in the left half-plane, 
the operator fails to be shift-invariant. Consequently, nei- 
ther s(x) nor st[{\ are stationary. An interesting consequence 
of Theorem 1 is that it relates the probability distribution of the 
non- stationary process st[«] to that of the stationary process 
UT\i] plus a minimal set of transient terms. 

Next, we show in Theorem 2 how the conditional probability 
of UT[i\ can be obtained from a characteristic form. To 
maintain the flow of the paper, the proof is postponed to 
Appendix D. 

Theorem 2: The probabiUty density function of ut[^] con- 
ditioned on (n — 1) previous is given by 

Vu(utW\ I {«T[^-i]}rJi') = 



(23) 



where 



/r U ( YTi=l w,/3L,T(a; - iT)) dx. 
IV. Signal Estimation 



(24) 



MMSE and MAP estimation are two common statistical 
paradigms for signal recovery. Since the optimal MMSE 
estimator is rarely realizable in practice, MAP is often used 
as the next best thing. In this section, in addition to applying 
the two methods to the proposed class of signals, we settle the 
question of knowing when MAP is a good approximation of 
MMSE. 

For the estimation purpose it is convenient to assume that 
the sampling instances associated with s[i] are included in 
the uniform sampling grid for which we want to estimate the 
values of s. In other words, we assume that T = — = — , 
where T is the samphng period in the definition of st\\ and 
riT is a positive integer. This assumption does impose no lower 



bound on the resolution of the grid because we can set T 
arbitrarily close to zero by increasing tit- 

To simphfy mathematical formulations, we use vectorial 
notations. We indicate the vector of noisy/noiseless measure- 
ments t>y §■ The vector st stands for the hypothetical 
realization {stW\Y}^=q of the process on the considered grid, 
and sj,,ir denotes the subset {sT[inT]}™o that corresponds 
to the points at which we have a sample. Finally, we represent 
the vector of estimated values {sT[fc]}^cf by sr. 

A. MMSE Denoising 

It is very common to evaluate the quaUty of an estimation 
method by means of the mean-square error, or SNR. In this 
regard, the best estimator, known as MMSE, is obtained by 
evaluating the posterior mean, or st = ]E{st | §}. 

For Gaussian processes, this expectation is easy to obtain, 
since it is equivalent to the best linear estimator [24]. However, 
there are only a few non-Gaussian cases where an exphcit 
closed form of this expectation is known. In particular, if the 
additive noise is white and Gaussian (no restriction on the 
distribution of the signal) and pure denoising is desired (T = 
1), then the MMSE estimator can be reformulated as 



SmMSE = S + (T^VlogPs(x)l 



(25) 



where smmse stands for st,mmse with T = 1, and tr^ is the 
variance of the noise [25], [26], [27]. Note that the pdf ps, 
which is the result of convolving the a priori distribution 
with the Gaussian pdf of the noise, is both continuous and 
differentiable. 



B. MAP Estimator 

Searching for the MAP estimator amounts to finding the 
maximizer of the distribution p{st | s). It is commonplace to 
reformulate this conditional probability in terms of the a priori 
distribution. 

The additive discrete noise n is white and Gaussian with 
the variance cr^. Thus, 



p(st I s) = p{s I St) 



Ps{st) 



^||s-ST,nTll2 



Ps{st) 



P~s{s) 



■ (26) 



In MAP estimation, we are looking for a vector st that 
maximizes the conditional a posteriori probability, so that (26) 
leads to 



st,map = argmax p(st I s) 



^l|s-ST,„.rl|^ 



arg max 

St 



arg max e 

St 



Ps{st) 



(2770-2) "2 p5(s) 

4-||s-St,„tII2 



Ps{st). (27) 



The last equality is due to the fact that neither (2770-2 
nor Ps{s) depend on the choice of st- Therefore, they play 
no role in the maximization. 
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If the pdf of St is bounded, the cost function (27) can be 
replaced with its logarithm without changing the maximizer. 
The equivalent logarithmic maximization problem is given by 

st.map = argniin \\st,„j. - s||2 - 2(tI logpJsT}- (28) 

St 

By using the pdf factorization provided by Theorem 1, (28) is 
further simpUfied to 



St, MAP 



argmm < ||st,„t 

St ' 



Sl|2 



-20-" 51 l0gP„(wT[fc]|{«T[fc-i]}"^J 



k=2n-l 



-2al logp, (sT[2n - 2], . . . , st[0]) |, (29) 

where each wtH is provided by the Unear combi- 
nation of the elements in st found in (17). If we 
denote the term (^—logpu{uT[k]\{uT[k — i]}"^^)^ by 

{ut [k], . . . ,UT[k — n + 1]), the MAP estimation becomes 
equivalent to the minimization problem 



St MAP 



arg mm 

St 



jST.riT 



s||2 



-l-A ^T{uT[k],...,UT[k-n + l]) 

fc=2n-l 

-Alogp,(sT[2n-2],...,ST[0])|, (30) 

where A = 2(7^. The interesting aspect is that the MAP estima- 
tor has the same form as (1) where the sparsity-promoting term 
in the cost function is determined by both and the 
distribution of the innovation. The well-known and successful 
TV regularizer corresponds to the special case where 4't(-) is 
the univariate function | • | and the FIR filter drl] is the finite- 
difference operator. In Appendix E, we show the existence of 
an irmovation process for which the MAP estimation coincides 
with the TV regularization. 

C. MAP vs MMSE 

To have a rough comparison of MAP and MMSE, it is 
beneficial to reformulate the MMSE estimator in (25) as a 
variational problem similar to (30), thereby, expressing the 
MMSE solution as the minimizer of a cost function that 
consists of a quadratic term and a sparsity-promoting penalty 
term. In fact, for sparse priors, it is shown in [13| that the 
minimizer of a cost function involving the £i-norm penalty 
term approximates the MMSE estimator more accurately than 
the commonly considered MAP estimator. Here, we propose 
a different interpretation without going into technical details. 
From (25), it is clear that 

smmse = s + cr^V log P5(smmse - bj), (31) 

where bg = cr^V logps(s). We can check that smmse in (31) 



sets the gradient of the cost J(s) 
b|) to zero. It suggests that 



s — s 



2<logPs(s- 



which is similar to (28). The latter result is only valid when 
the cost function has a unique minimizer. Similarly to [13], 
it is possible to show that, under some mild conditions, this 
constraint is fulfilled. Nevertheless, for the qualitative compar- 
ison of MAP and MMSE, we only focus on the local properties 
of the cost functions that are involved. The main distinction 
between the cost functions in (32) and (28) is the required pdf. 
For MAP, we need Ps, which was shown to be factorizable by 
the introduction of generalized finite differences. For MMSE, 
we require pg. Recall that ps is the result of convolving ps with 
a Gaussian pdf. Thus, irrespective of the discontinuities of Ps, 
the function pg is smooth. However, the latter is no longer 
separable, which complicates the minimization task. The other 
difference is the offset term hg in the MMSE cost function. 
For heavy-tail irmovations such as a-stables, the convolution 
by the Gaussian pdf of the noise does not greatly affect ps. 
In such cases, ps can be approximated by Ps fairly well, 
indicating that the MAP estimator suffers from a bias (bg). 
The effect of convolving ps with a Gaussian pdf becomes more 
evident as ps decays faster. In the extreme case where Ps is 
Gaussian, ps is also Gaussian (convolution of two Gaussians) 
with a different mean (which introduces another type of bias). 
The fact that MAP and MMSE estimators are equivalent 
for Gaussian irmovations indicates that the two biases act in 
opposite directions and cancel out each other. In summary, for 
super-exponentially decaying innovations, MAP seems to be 
consistent with MMSE. For heavy-tail innovations, however, 
the MAP estimator is a biased version of the MMSE, where 
the effect of the bias is observable at high noise powers. The 
scenario in which MAP diverges most from MMSE might be 
the exponentially decaying innovations, where we have both a 
mismatch and a bias in the cost function, as will be confirmed 
in the experimental part of the paper. 

V. Example: Levy Process 

To demonstrate the results and impUcations of the theory, 
we consider Levy processes as special cases of the model in 
Figure 1. Levy processes are roughly defined as processes with 
stationary and independent increments that start from zero. The 
processes are compatible with our model by setting ^ = 
(i.e., n = 1 and ri = 0), L~-^ = J^, or h{x,T) = l[o,oo[{x — 
''')^l[o.oo[(^''") which imposes the boundary condition s(0) = 

w{T)dT = 0. The corresponding discrete FIR filter has the 
two taps dT[0] = 1 and drl^] = — 1. The impulse response of 
L^^tL"^ is given by 



I3l,t{x) = u{x) - u{x - T) 



1, < .T < T 
0, otherwise. 



(33) 



Smmse = argmin ||s - s||2 - 2o-^ logps(s - bj). (32) 



A. MMSE Interpolation 

A classical problem is to interpolate the signal using noise- 
less samples. This corresponds to the estimation of st[^] 
where ut \ (tit does not divide 6) by assuming s[i\ = 
sriiriT] (0 < i < m). Although there is no noise in this 
scenario, we can still employ the MMSE criterion to estimate 
st[0]. We show that the MMSE interpolator of a Levy process 
is the simple linear interpolator, irrespective of the type of 
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innovation. To prove this, we assume Iht < < {I + l)nT 
and rewrite st[^] as 

e 

ST[e] = st[M+ Y1 fT[k] - ST[k - 1] . (34) 

ut [kJ 

This enables us to compute 

st[0] = E{sT[e] I {sTiiriTmo} 



= STUriTi 



-e{ Y1 UT[k]\{sT[knT]}iy35) 



Since the mapping from the set {sT[*?^T]}i to {st[0]} U 
{ui[i] = ST[(i + 1)^^T] — ST['inT]}i is one to one, the two 
sets can be used interchangeably for evaluating the conditional 
expectation. Thus, 



st[0] = stIItit] + Yl ^{uT[k]\{sT[0]}U{ui[i\}i}. 

(36) 



According to Lemma 1, UT[k] (for fc > 0) is independent 
of st[0] and urik ], where ^ . By rewriting ui[i] as 



^(^ + l)nT 

=inT + 

UT[k] unless inr + 1 <k <{i + l)nT. Hence, 



we conclude that u-\_[i] is independent of 



ST[e\ = ST[lnT]+ e{wt[A;] Li[/ + 1]}. (37) 



Since ui[l + 1] = Z]l!^n"+i"^M {"T[«]}i is a se- 
quence of i.i.d. random variables, the expected mean of 
UT\i] conditioned on ui[l + 1] is the same for all i with 
IriT + 1 <i < {I + l)nT, which yields 



e{ut[A;]|ui[Z + 1]} = ^ e{ut[«]|«i[/ + 1]} 

L[i + 1]} 



1] 



riT 



By applying (38) to (37), we obtain 

ST[e] = st[M + ^— + 



(38) 



-IriT 



(1 - Xe) stIIut] + Xe st[{1 + l)nT], (39) 
. Obviously, (39) indicates a linear interpo- 



where A^i 
lation between the samples. 

B. MAP Denoising 

Since n = 1, the finite differences are independent. 
Therefore, the conditional probabilities involved in Theorem 
1 can be replaced with the simple pdf values 

k 

P,(st[A;],...,st[0]) =Ps{sT[0\)X{pu{uT[e\). (40) 



In addition, since fw (0) = 0, from Theorem 2, we have 



Pu{ut[0\) 



(41) 



In the case of impulsive Poisson innovations, as shown in 
(10), the pdf of UT[i] has a single mass probability at a; = 0. 
Hence, the MAP estimator will choose UT[i] = for all i, 
resulting in a constant signal. In other words, according to the 
MAP criterion and due to the boundary condition .s(0) = 0, the 
optimal estimate is nothing but the trivial all-zero function. For 
the other types of innovations where the pdf of the increments 
ut[{\ is bounded, or, equivalently, when the Levy density u(-) 
is singular at the origin [20], one can reformulate the MAP 
estimation in the form of (30) as 



st[0] = 



and 



arg min | ^ {S\i] - ST[mT])^ 
-|-A*t(st[1]) 

niTlT 

+X Y *T(sT[fc] -ST[k- 1])}(42) 

fe=2 

where ^'t(-) = — logptt(-) and A = 2(t^. Because shifting 
the function '^t with a fixed additive scalar does not change 
the minimizer of (42), we can modify the function to pass 
through the origin (i.e., ^t{0) = 0). After having applied this 
modification, the function ^t=i presents itself as shown in 
Figure 3 for various innovation processes such as 

1) Gaussian innovation: a = 1, and v{a) = 0, which 
implies 



Pu{x) = 



2w 



2) Laplace-type innovation: cr = 0, v{a) = - — |^ 
which imphes (see Appendix E) 



(43) 



•la\ 



(44) 



The Levy process of this innovation is known as the 
variance gamma process [28]. 
3) Cauchy iimovation (a-stable with a = 1): a = 0, v{a) = 

, which imphes 



Pu{x) 



c + 87ra;2 



(45) 



The parameters of the above innovations are set such that 
they all lead to the same entropy value \A1. 
The negative log-likelihoods of the first two innovation types 
resemble the £2 and £1 regularization terms. However, the 
curve of for the Cauchy innovation shows a nonconvex 
log-type function. 



14p 
12 \ 
10- 



Gaussian 
- Laplace 
- Cauchy 




Fig. 3. The ^t(') = ~ logPu{') functions at T = 1 for Gaussian, Laplace, 
and Cauchy distributions. The parameters of each pdf are tuned such that they 
all have the same entropy and the curves are shifted to enforce them to pass 
through the origin. 



C. MMSE Denoising 

As discussed in Section IV-C, the MMSE estimator, either 
in the expectation form or as a minimization problem, is 
not separable with respect to the inputs. This is usually a 
critical restriction in high dimensions. Fortunately, due to the 
factorization of the joint a priori distribution, we can lift 
this restriction by employing the powerful message-passing 
algorithm. The method consists of representing the statistical 
dependencies between the parameters as a graph and finding 
the marginal distributions by iteratively passing the estimated 
pdfs along the edges [29]. The transmitted messages along 
the edges are also known as beliefs, which give rise to the 
alternative name of belief propagation. In general, the marginal 
distributions (beliefs) are continuous-domain objects. Hence, 
for computer simulations we need to discretize them. 

In order to define a graphical model for a given joint 
probability distribution, we need to define two types of nodes: 
variable nodes that represent the input arguments for the joint 
pdf and factor nodes that portray each one of the terms in 
the factorization of the joint pdf The edges in the graph are 
drawn only between nodes of different type and indicate the 
contribution of an input argument to a given factor 

For the Levy process, we consider the joint conditional pdf 
p({sT[fc]}fc I factorized as 



(46) 



where Z = ({s[i]}™ j^) is a normalization constant that 
depends only on the noisy measurements and Q is the Gaussian 
function defined as 

g {x-a^) = —^e"^ . (47) 



V27r 

Note that, by definition, we have st[0] = 0. 

For illustration purposes, we consider the special case of 
pure denoising corresponding to T = 1. We give in Figure 
4 the bipartite graph G — (V, F, E) associated to the joint 
pdf (46). The variable nodes V — {1, . . . ., m} depicted in the 
middle of the graph stand for the input arguments {sT[fc]}feLi- 
The factor nodes F — {!,..., 2to} are placed at the right and 




prior 
distribution 



conditional 
probability 



S(.§[2|-.,i[2];^,y 
S(.§[3|-.5i[3];a^) 
Q(s\m] - si[m] ; ct,^,) 

noise 
distribution 



Fig. 4. Factor graph for the MMSE denoising of a Levy process. There are 
m variable nodes (circles) and 2m factor nodes (squares). 



left sides of the variable nodes depending on whether they 
represent the Gaussian factors or the Pu{') factors, respectively. 
The set of edges E — {(j,a) S V x F} also indicates a 
participation of the variable nodes in the corresponding factor 
nodes. 

The message-passing algorithm consists of initializing the 
nodes of the graph with proper ID functions and updat- 
ing these functions through communications over the graph. 
It is desired that we eventually obtain the marginal pdf 
p(si[fc] I on the /cth variable node, which enables 

us to obtain the mean. The details of the messages sent over 
the edges and updating rules are given in [30], [31]. 

VI. Simulation Results 

For the experiments, we consider the denoising of Levy 
processes for various types of innovation, including those 
introduced in Section II-A and the Laplace-type innovation 
discussed in Appendix E. Among the heavy-tail a-stable 
innovations, we choose the Cauchy distribution corresponding 
to a = 1. The four implemented denoising methods are 
1) Linear minimum mean-square error (LMMSE) method 
or quadratic regularization (also known as smoothing 
spline [32]) defined as 



arg mm 



g„2 



\\l+\l^{s[i\-s[i-l 



i=l 



(48) 



2) 



where A should be optimized. For finding the optimum A 
for given innovation statistics and a given additive-noise 
variance, we search for the best A for each realization by 
comparing the results with the oracle estimator provided 
by the noiseless signal. Then, we average A over a num- 
ber of realizations to obtain a unified and realization- 
independent value. This procedure is repeated each time 
the statistics (either the innovation or the additive noise) 
change. For Gaussian processes, the LMMSE method 
coincides with both the MAP and MMSE estimators. 
Total-variation regularization represented as 



ars; mm 



{« 



111+ A 



1=1 



111}. 



(49) 



where A should be optimized. The optimization process 
for A is similar to the one explained for the LMMSE 
method. 
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-MMSE = LMMSI 
-Log 




Fig. 5. SNR improvement vs. variance of the additive noise for Gaussian 
innovations. Tlie denoising metliods are: MMSE estimator (wliich is equiv- 
alent to MAP and LMMSE estimators liere), Log regularization, and TV 
regularization. 



Fig. 6. SNR improvement vs. variance of tlie additive noise for Gaussian 
compound Poisson innovations. Tlie denoising methods are: MMSE estimator. 
Log regularization, TV regularization, and LMMSE estimator. 



3) Logarithmic (Log) regularization described by 

argrnin {lis - s|||^ + a£ log (l + M^f^^) (50) 

where A should be optimized. The optimization pro- 
cess is similar to the one explained for the LMMSE 
method. In our experiments, we keep e = 1 fixed 
throughout the minimization steps (e.g., in the gradient- 
descent iterations). Unfortunately, Log is not necessarily 
convex, which might result in a nonconvex cost function. 
Hence, it is possible that gradient-descent methods get 
trapped in local minima rather than the desired global 
minimum. For heavy-tail innovations (e.g., a-stables), 
the Log regularizer is either the exact, or a very good 
approximation of, the MAP estimator. 

4) Minimum mean-square error denoiser which is imple- 
mented using the message-passing technique discussed 
in Section V-C. 

The experiments are conducted in MATLAB. We have 
developed a graphical user interface that facilitates the pro- 
cedures of generating samples of the stochastic process and 
denoising them using MMSE or the variational techniques. 

We show in Figure 5 the SNR improvement of a Gaus- 
sian process after denoising by the four methods. Since the 
LMMSE and MMSE methods are equivalent in the Gaussian 
case, only the MMSE curve obtained from the message- 
passing algorithm is plotted. As expected, the MMSE method 
outperforms the TV and Log regularization techniques. The 
counter intuitive observation is that Log, which includes a 
nonconvex penalty function, performs better than TV. Another 
advantage of the Log regularizer is that it is differentiable and 
quadratic around the origin. 

A similar scenario is repeated in Figure 6 for the compound- 
Poisson innovation with A = 0.6 and Gaussian amplitudes 
(zero-mean and a = 1). As mentioned in Section V-B, since 
the pdf of the increments contains a mass probability at a; = 
0, the MAP estimator selects the all-zero signal as the most 
probable choice. In Figure 6, this trivial estimator is excluded 
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Fig. 7. SNR improvement vs. variance of the additive noise for Cauchy (a- 
stable with a = 1) innovations. The denoising methods are: MMSE estimator. 
Log regularization (which is equivalent to MAP here), TV regularization, and 
LMMSE estimator. 



from the comparison. It can be observed that the performance 
of the MMSE denoiser, which is considered to be the gold 
standard, is very close to that of the TV regularization method 
at low noise powers where the source sparsity dictates the 
structure. This is consistent with what was predicted in [13]. 
Meanwhile, it performs almost as well as the LMMSE method 
at large noise powers. There, the additive Gaussian noise is the 
dominant term and the statistics of the noisy signal is mostly 
determined by the Gaussian constituent, which is matched to 
the LMMSE method. Excluding the MMSE method, none of 
the other three outperforms another one for the entire range 
of noise. 

Heavy-tail distributions such as a-stables produce sparse 
or compressible sequences. With high probability, their real- 
izations consist of a few large peaks and many insignificant 
samples. Since the convolution of a heavy-tail pdf with a Gaus- 
sian pdf is still heavy-tail, the noisy signal looks sparse even 
at large noise powers. The poor performance of the LMMSE 
method observed in Figure 7 for Cauchy distributions confirms 
this characteristic. The pdf of the Cauchy distribution, given 
by ^(i]^2) , is in fact the symmetric a-stable distribution with 
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Fig. 8. SNR improvement vs. variance of the additive noise for Laplace-type 
innovations. Tlie denoising metliods are: MMSE estimator. Log regularization, 
TV regularization (which is equivalent to MAP here), and LMMSE estimator. 

a = 1. The Log regularizer corresponds to the MAP estimator 
of this distribution while there is no direct link between the 
TV regularizer and the MAP or MMSE criteria. The SNR 
improvement curves in Figure 7 indicate that the MMSE 
and Log (MAP) denoisers for this sparse process perform 
similarly (specially at small noise powers) and outperform the 
corresponding £i-norm regularizer (TV). 

In the final scenario, we consider innovations with = 
and via) = ^-r-r- This results in finite differences obtained 

V ^ \a\ 

at T = 1 that follow a Laplace distribution (see Appendix E). 
Since the MAP denoiser for this process coincides with TV 
regularization, sometimes the Laplace distribution has been 
considered to be a sparse prior However, it is proved in [11], 
[12] that the realizations of a sequence with Laplace prior 
are not compressible, almost surely. The curves presented in 
Figure 8 show that TV is a good approximation of the MMSE 
method only in light-noise conditions. For moderate to large 
noise, the LMMSE method is better than TV. 

VII. Conclusion 

In this paper, we studied continuous-time stochastic pro- 
cesses where the process is defined by applying a linear 
operator on a white innovation process. For specific types 
of innovation, the procedure results in sparse processes. We 
derived a factorization of the joint posterior distribution for the 
noisy samples of the broad family ruled by fixed-coefficient 
stochastic differential equations. The factorization allows us to 
efficiently apply statistical estimation tools. A consequence of 
our pdf factorization is that it gives us access to the MMSE 
estimator. It can then be used as a gold standard for evaluating 
the performance of regularization techniques. This enables 
us to replace the MMSE method with a more-tractable and 
computationally efficient regularization technique matched to 
the problem without compromising the performance. We then 
focused on Levy processes as a special case for which we 
studied the denoising and interpolation problems using MAP 
and MMSE methods. We also compared these methods with 
the popular regularization techniques for the recovery of sparse 
signals, including the £i norm (e.g., TV regularizer) and the 



Log regularization approaches. Simulation results showed that 
we can almost achieve the MMSE performance by tuning 
the regularization technique to the type of innovation and 
the power of the noise. We have also developed a graphical 
user interface in MATLAB which generates realizations of 
stochastic processes with various types of innovation and 
allows the user to apply either the MMSE or variational 
methods to denoise the samples'. 

Appendix A 
Characteristic Forms 

In Gelfand's theory of generalized random processes, the 
process is defined through its inner product with a space of 
test functions, rather than point values. For a random process 
w and an arbitrary test function cp chosen from a given space, 
the characteristic form is defined as the characteristic function 
of the random variable X — {w,(p) and given by 

= E{c-'><'"^'P^} = [ px{x)e'>''dx 
Jr 

= :F,{px {x)}il). (51) 

As an example, let be a normalized white Gaussian noise 
and let ip be an arbitrary function in L2(M). It is well-known 
that {w^ ,(p) is a zero-mean Gaussian random variable with 
variance Ht^sjll^- Thus, in this example we have that 

^^^(</?) =e-^ll'^lli2. (52) 

An interesting property of the characteristic forms is that 
they help determine the joint probability density functions for 
arbitrary finite dimensions as 

k 

= -i^x{px(x)}(a;i, . . . jWfe), (53) 

where {(pi}i are test functions and {ui} are scalars. Equation 
(53) shows that an inverse Fourier transform of the character- 
istic form can yield the desired pdf. Beside joint distributions, 
characteristic forms are useful for generating moments too: 

QniA hrifc ^ , 

-L «, i — l 

= (H)"^^ / x"^ ■ ■ ■ x'^''px{x-i, . . . ,Xk)dxi- ■ ■ dxk 

= (_j)"i + -+"fcE{a;7i •••4^'=}. (54) 

Note that the definition of random processes through char- 
acteristic forms includes the classical definition based on the 
point values by choosing Diracs as the test functions (if 
possible). 

Except for the stable processes, it is usually hard to find the 
distributions of linear transformations of a process. However, 
there exists a simple relation between the characteristic forms: 

'The GUI is available at http://bigwww.epfl.ch/amini/MATLAB_codes/ 
SSS_GUI.zip 
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Let w he a random process and define ^ = hw, where L is 
a linear operator. Also denote the adjoint of L by L*. Then, 
one can write 

^^{(f) = E{e-j<^"'''^>} =E{e-j<'"'^*'^>} 

= #„(L». (55) 

Now it is easy to extract the probability distribution of ^ from 
its characteristic form. 

Appendix B 

Specification of nXH-ORDER Shaping Kernels 

To show the existence of a kernel h for the nth-order 
differential operator L = A„ HiLiC^ ~ '''^)' define 

r e'-*(--)(l[o,oo[(a;-r) 
hi{x,T)=< - l[o,oo[(Si - t)), 3fJrj = 0, (56) 

[ e^'(^-^)l[o,oo[(a;-r), JRn < 0, 

where .t; are nonpositive fixed real numbers. It is not hard to 
check that hi satisfies the conditions (i)-(iii) for the operator 
Li — D — Tjl. Next, we combine hi to form a proper kernel 
for as 

-. « n n 

h{x,T) = — ]Jh,(T,+,,n)]Jdn\ . (57) 

By relying on the fact that the hi satisfy conditions (i)-(iii), 
it is possible to prove by induction that h also satisfies (i)- 
(iii). Here, we only provide the main idea for proving (i). We 
use the factorization L = A„Li • • • L„ and sequentially apply 
every on h. The starting point i — n yields 



as 



Lnh{x,T) 



n"=l hi{Ti+i,Ti)dTi 



Tn) 



/ S{X - Tn) 

— / J{hi{n+^,n)Jldn\i5?,) 



Thus, L„/i(a;, r) has the same form as h with n replaced by 
n — 1. By continuing the same procedure, we finally arrive at 
'Lihi{x,t), which is equal to 5{x — r). 

Appendix C 
Proof of Theorem 1 

For the sake of simpUcity in the notations, for ^ > n we 
define 

UT[e] 
UT[e-i] 



ut [n] 
St [n - 1] 
ST[n-2] 



st[0] 



(59) 



Since the UT[i\ are linear combinations of ST[i], the ((^ + 
1) X l) vector u[^] can be linearly expressed in terms of st[«] 



\x[e] = D(^+i)x(0+i) 



ST[e] 

st[0-1] 
st[0] 



(60) 



where D(e+i)x(6)+i) is an upper-triangular matrix defined by 

dxiQ] dT[l] ■■■ drln] ••• 
drPi ■■■ drin-l] drln] ■■■ 

••• drp] ••• dT[n] 







nx(e+l-n) 



(61) 



Since <iT[0] 7^ 0, none of the diagonal elements of the 
upper-triangular matrix D(g_|_i)x(e+i) is zero. Thus, the ma- 
trix is invertible because detD(e_|_i)x(ei+i) = (dTlO])^"*"^"". 
Therefore, we have that 



Ps,u{^[0\) = 



Ps{sT\ei....sTn) 



|rfT[0] 



e+l-rt 



(62) 



A direct consequence of Lennma 1 is that, for 9>2n—l, 
we obtain 

Ps,u{uT[e\ I U[^ - 1]) = Pu(uT[e] I {uT[e - (63) 

which, in conjunction with Bayes' rule, yields 



Ps,u{uT[e]\n{e-i]) 



= Pu{uT[e]\{uT[e-i])'lJ). (64) 



By multiplying equations of the form (64) for 9 = 2n 
1, . . . , fc, we get 



Ps,u(u[2n — 2]) 



= n Pu{uT[e]\{uT[0-i]Y-l){65) 



It is now easy to complete the proof by substituting the 
numerator and denominator of the left-hand side in (65) by 
the equivalent forms suggested by (62). 

Appendix D 
Proof of Theorem 2 

As developed in Appendix A, the characteristic form can 
be used to generate the joint probability density functions. To 
use (53), we need to represent ut[*] as inner-products with 
the white process. This is already available from (19). This 
yields 



Pu{{uT[e - i\}'!=o) = 
-{X}{M^t,^^f^^HOT-iT-.))]{{uT[e - i]}to)(66) 
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From (3), we have 



1=0 



= cJr /™ ( Eto '^^|3I.,T{0T~iT~x))dx 

Using (24), it is now easy to verify that 

The only part left to mention before completing the proof is 
that 



Appendix E 
When does TV regularization meet MAP? 

The TV-regularization technique is one of the successful 
methods in denoising. Since the TV penalty is separable with 
respect to first-order finite differences, its interpretation as a 
MAP estimator is valid only for a Levy process. Moreover, 
the MAP estimator of a Levy process coincides with TV 
regularization only if "i^rix) = — logPu{x) = 7|.t| +1], where 
7 and 77 are constants such that 7 > 0. This condition implies 
that pu is the Laplace pdf Pu{x) — ''I^L This pdf is a 
valid distribution for the first-order finite differences of the 
Levy process characterized by the innovation with /i — a = 
and v{a) = iL because 



|a| 



Thus, we can write 
d 



(ej"^"- 1)' 



-y\a\ 



-da 



( cos(cL;a) — l) - 



-7a 



-da. 



(70) 



duj 



-2 / sin(wa)e~'*'°da 
Jo 
~2a; 



(71) 



By integrating (71), we obtain that 



' log(7^ 



uj ) + 77, where 77 is a constant. The key point in finding this 
constant is the fact that fi{0) — 0, which results in fi{uj) = 



log : 



suggests that 



Now, for the sampling period T, Equation (41) 



(x) 



(72) 



V^2^-^r(r) ' 

where Kt{-) is the modified Bessel function of the second 
kind. The latter probability density function is known as 
symmetric variance-gamma or symm-gamma. It is not hard 
to check that we obtain the desired Laplace distribution for 
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Fig. 9. The function = — logpu for different values of T after enforcing 
the curves to pass through the origin by applying a shift. For T = 1. the 
density function pu follows a Laplace law. Therefore, the con'esponding 'I't 
is the absolute-value function. 



T — \. However, this value of T is the only one for which 
we observe this property. Should the sampling grid become 
finer or coarser, the MAP estimator would no longer coincide 
with TV regularization. We show in Figure 9 the shifted 
functions for various T values for the aforementioned 
innovation where 7=1. 
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